Using a Natural Cubic Spline in Place of the Quadratic Polynomial
For simplicity, we have introduced linear models for quantitative outcomes using a quadratic polynomial to represent the effect of a single quantitative predictor. In practice, we would typically use a natural cubic spline instead of a polynomial.
Using the EFFECT statement and non-positional syntax, it is straightforward to modify our code to use a natural cubic spline with 4 df in place of the quadratic polynomial.
Initial Fit and Diagnostic Plots
First, plot the residuals vs the predicted values.
LIBNAME NHANES "../../../BMI552_Datasets/NHANES_May2023/";
*LIBNAME NHANES "~/my_shared_file_links/u48718377/NHANES";
OPTIONS FMTSEARCH=(NHANES.formats_NHANES work) NODATE NONUMBER;
TITLE1; TITLE2;
ODS NOPROCTITLE;
ODS GRAPHICS / LOESSMAXOBS=20000;
DATA Analysis;
SET NHANES.NHANES (RENAME=(Race1=Race));
RUN;
ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=F;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = AgeCS / SELECTION=NONE;
OUTPUT OUT=Fitted P=Pred R=Resid;
STORE SplineModel;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=Fitted (OBS=10);
VAR Age TotChol Pred Resid;
FORMAT Pred 6.2 Resid 6.2;
RUN;
PROC SGPLOT DATA=Fitted;
LOESS X=Pred Y=Resid / LINEATTRS=(COLOR="Red");
RUN;
| Obs | Age | TotChol | Pred | Resid |
|---|---|---|---|---|
| 1 | 39 | 4.94 | 5.13 | -0.19 |
| 2 | 37 | 6.36 | 5.07 | 1.29 |
| 3 | 34 | 7.24 | 4.97 | 2.27 |
| 4 | 40 | 6.28 | 5.16 | 1.12 |
| 5 | 52 | 3.1 | 5.35 | -2.25 |
| 6 | 58 | 4.16 | 5.33 | -1.17 |
| 7 | 61 | 4.86 | 5.29 | -0.43 |
| 8 | 79 | 4.01 | 4.88 | -0.87 |
| 9 | 22 | 3.59 | 4.58 | -0.99 |
| 10 | 37 | 4.11 | 5.07 | -0.96 |
Plot the square root of the absolute value of the residuals vs the fitted values.
DATA Fitted;
SET Fitted;
Sqrt_Abs_Resid = SQRT(ABS(Resid));
RUN;
PROC SGPLOT DATA=Fitted;
LOESS X=pred Y=sqrt_abs_resid;
REG X=Pred Y=Sqrt_Abs_Resid / NOMARKERS LINEATTRS=(COLOR="green");
RUN;
Plot the residuals and square root of the absolute value of the residuals against each predictor variable.
PROC SGPLOT DATA=Fitted;
LOESS X=Age Y=Resid / LINEATTRS=(COLOR="Red");
RUN;
PROC SGPLOT DATA=Fitted;
LOESS X=Age Y=Sqrt_Abs_Resid;
REG X=Age Y=Sqrt_Abs_Resid / NOMARKERS LINEATTRS=(COLOR="green");
RUN;
Examine a normal scores plot for the residuals.
PROC UNIVARIATE DATA=Fitted NOPRINT;
QQPLOT Resid / NORMAL(MU=EST SIGMA=EST);
RUN;
Formal Model Comparisons (AIC/BIC)
AIC or BIC can be used to compare different model specifications.
The following SAS code calculates AIC for models using natural cubic splines with 2-6 df and our original quadratic model.
ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=Quadratic;
EFFECT AgeCS = Polynomial(Age / DEGREE=2);
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA Quadratic;
LENGTH Model $30;
SET Quadratic;
IF Label1="AIC";
Model = "Quadratic";
AIC = nValue1;
KEEP Model AIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF2;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(10 50 90));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF2;
LENGTH Model $30;
SET DF2;
IF Label1="AIC";
Model = "2 df";
AIC = nValue1;
KEEP Model AIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF3;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 35 65 95));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF3;
LENGTH Model $30;
SET DF3;
IF Label1="AIC";
Model = "3 df";
AIC = nValue1;
KEEP Model AIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF4;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF4;
LENGTH Model $30;
SET DF4;
IF Label1="AIC";
Model = "4 df";
AIC = nValue1;
KEEP Model AIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF5;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF5;
LENGTH Model $30;
SET DF5;
IF Label1="AIC";
Model = "5 df";
AIC = nValue1;
KEEP Model AIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF6;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(2.5 18.33 34.17 50 65.83 81.67 97.5));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF6;
LENGTH Model $30;
SET DF6;
IF Label1="AIC";
Model = "6 df";
AIC = nValue1;
KEEP Model AIC;
RUN;
DATA ModelComparisons;
SET Quadratic DF2 DF3 DF4 DF5 DF6;
RUN;
PROC MEANS DATA=ModelComparisons MAX NOPRINT;
VAR AIC;
OUTPUT OUT=MinIC MIN= / AUTONAME;
RUN;
DATA ModelComparisons;
IF _N_=1 THEN SET MinIC;
SET ModelComparisons;
AIC = AIC - AIC_Min;
KEEP Model AIC;
RUN;
ODS EXCLUDE NONE;
TITLE "Model Comparisons (AIC)";
PROC PRINT DATA=ModelComparisons NOOBS;
VAR Model AIC;
FORMAT AIC 6.1;
RUN;
TITLE;
| Model | AIC |
|---|---|
| Quadratic | 3.6 |
| 2 df | 1.2 |
| 3 df | 0.0 |
| 4 df | 1.5 |
| 5 df | 3.4 |
| 6 df | 4.9 |
Based on AIC, the optimal model is the natural cubic spline with 3 df.
The following SAS code calculates BIC for models using natural cubic splines with 2-6 df.
ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=Quadratic;
EFFECT AgeCS = Polynomial(Age / DEGREE=2);
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA Quadratic;
LENGTH Model $30;
SET Quadratic;
IF Label1="SBC";
Model = "Quadratic";
BIC = nValue1;
KEEP Model BIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF2;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(10 50 90));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF2;
LENGTH Model $30;
SET DF2;
IF Label1="SBC";
Model = "2 df";
BIC = nValue1;
KEEP Model BIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF3;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 35 65 95));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF3;
LENGTH Model $30;
SET DF3;
IF Label1="SBC";
Model = "3 df";
BIC = nValue1;
KEEP Model BIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF4;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF4;
LENGTH Model $30;
SET DF4;
IF Label1="SBC";
Model = "4 df";
BIC = nValue1;
KEEP Model BIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF5;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF5;
LENGTH Model $30;
SET DF5;
IF Label1="SBC";
Model = "5 df";
BIC = nValue1;
KEEP Model BIC;
RUN;
PROC GLMSELECT DATA=Analysis;
ODS OUTPUT FitStatistics=DF6;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(2.5 18.33 34.17 50 65.83 81.67 97.5));
MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;
DATA DF6;
LENGTH Model $30;
SET DF6;
IF Label1="SBC";
Model = "6 df";
BIC = nValue1;
KEEP Model BIC;
RUN;
DATA ModelComparisons;
SET Quadratic DF2 DF3 DF4 DF5 DF6;
RUN;
PROC MEANS DATA=ModelComparisons MAX NOPRINT;
VAR BIC;
OUTPUT OUT=MinIC MIN= / AUTONAME;
RUN;
DATA ModelComparisons;
IF _N_=1 THEN SET MinIC;
SET ModelComparisons;
BIC = BIC - BIC_Min;
KEEP Model BIC;
RUN;
ODS EXCLUDE NONE;
TITLE "Model Comparisons (BIC)";
PROC PRINT DATA=ModelComparisons NOOBS;
VAR Model BIC;
FORMAT BIC 6.1;
RUN;
TITLE;
| Model | BIC |
|---|---|
| Quadratic | 2.4 |
| 2 df | 0.0 |
| 3 df | 5.7 |
| 4 df | 14.0 |
| 5 df | 22.8 |
| 6 df | 31.2 |
Based on BIC, the optimal model is the natural cubic spline with 2 df, which fits decisively better than the natural cubic spline with 4 df.
Note that we would not generally simplify our model based on these types of comparisons.
Inference
Given a satisfactory model, we can obtain compatability intervals for means and differences in means using ESTIMATE statements.
ODS EXCLUDE ALL;
PROC PLM RESTORE=SplineModel;
ESTIMATE "Age 35" Intercept 1 AgeCS [1, 35] / CL ALPHA=0.03125;
ESTIMATE "Age 45" Intercept 1 AgeCS [1, 45] / CL ALPHA=0.03125;
ESTIMATE "Age 55" Intercept 1 AgeCS [1, 55] / CL ALPHA=0.03125;
ESTIMATE "Age 65" Intercept 1 AgeCS [1, 65] / CL ALPHA=0.03125;
ODS OUTPUT Estimates=E;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label Lower Estimate Upper;
LABEL Label='00'x;
FORMAT Estimate 6.3 Lower 6.3 Upper 6.3;
RUN;
ODS EXCLUDE ALL;
PROC PLM RESTORE=SplineModel;
ESTIMATE "Age 45 v 35" AgeCS [1, 45] [-1, 35] / CL ALPHA=0.03125;
ESTIMATE "Age 55 v 35" AgeCS [1, 55] [-1, 35] / CL ALPHA=0.03125;
ESTIMATE "Age 55 v 45" AgeCS [1, 55] [-1, 45] / CL ALPHA=0.03125;
ESTIMATE "Age 65 v 35" AgeCS [1, 65] [-1, 35] / CL ALPHA=0.03125;
ESTIMATE "Age 65 v 45" AgeCS [1, 65] [-1, 45] / CL ALPHA=0.03125;
ESTIMATE "Age 65 v 55" AgeCS [1, 65] [-1, 55] / CL ALPHA=0.03125;
ODS OUTPUT Estimates=E;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label Lower Estimate Upper;
LABEL Label='00'x;
FORMAT Estimate 6.3 Lower 6.3 Upper 6.3;
RUN;
| Lower | Estimate | Upper | |
|---|---|---|---|
| Age 35 | 4.947 | 5.004 | 5.061 |
| Age 45 | 5.224 | 5.278 | 5.333 |
| Age 55 | 5.303 | 5.349 | 5.395 |
| Age 65 | 5.171 | 5.227 | 5.284 |
| Lower | Estimate | Upper | |
|---|---|---|---|
| Age 45 v 35 | 0.199 | 0.274 | 0.350 |
| Age 55 v 35 | 0.269 | 0.345 | 0.421 |
| Age 55 v 45 | 0.014 | 0.071 | 0.128 |
| Age 65 v 35 | 0.146 | 0.223 | 0.300 |
| Age 65 v 45 | -0.141 | -0.051 | 0.039 |
| Age 65 v 55 | -0.164 | -0.122 | -0.080 |
The estimated mean serum cholesterol levels are \((4.947,5.061)\) mmol/L for 35 year olds, \((5.224,5.333)\) mmol/L for 45 year olds, \((0.199,0.350)\) mmol/L greater than 35 year olds; \((5.303,5.395)\) mmol/L for 55 year olds, \((0.269,0.421)\) mmol/L greater than 35 year olds and \((0.014,0.128)\) mmol/L greater than 45 year olds; and \((5.171,5.284)\) mmol/L for 65 year olds, \((0.146,0.300)\) mmol/L greater than 35 year olds, 0.141 mmol/L lower to 0.039 mmol/L greater than 45 year olds, and \((0.080,0.164)\) mmol/L lower than 55 year olds.
We can perform F-tests for the overall age effect and for linearity of the age effect.
ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = AgeCS / SELECTION=NONE;
ODS OUTPUT ANOVA=Full;
RUN;
DATA Full (KEEP=SS DF RENAME=(SS=SS_Full DF=DF_Full));
SET Full;
WHERE Source="Error";
RUN;
PROC GLMSELECT DATA=Analysis;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = Age / SELECTION=NONE;
ODS OUTPUT ANOVA=Linear;
RUN;
DATA Linear (KEEP=Test DF SS);
SET Linear;
WHERE Source="Error";
Test="Non-Linear Age";
RUN;
PROC GLMSELECT DATA=Analysis;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = / SELECTION=NONE;
ODS OUTPUT ANOVA=Null;
RUN;
DATA Null (KEEP=Test DF SS);
SET Null;
WHERE Source="Error";
Test="Overall Age";
RUN;
DATA FTests;
LENGTH Test $50;
IF _N_ = 1 THEN SET Full;
SET Linear Null;
SS = SS-SS_Full;
DF = DF-DF_Full;
MS = SS/DF;
MS_Full = SS_Full/DF_Full;
F = MS/MS_Full;
ProbF = EXP(LOGSDF("F",F,DF,DF_Full));
sValue = -LOG(ProbF)/LOG(2);
KEEP Test DF MS F ProbF sValue;
RUN;
ODS EXCLUDE NONE;
TITLE "F-Tests";
PROC PRINT DATA=FTests NOOBS;
VAR Test DF MS F ProbF sValue;
FORMAT sValue 6.2;
FORMAT ProbF PVALUE6.4;
RUN;
TITLE;
| Test | DF | MS | F | ProbF | sValue |
|---|---|---|---|---|---|
| Non-Linear Age | 3 | 112.921 | 106.477 | <.0001 | 221.67 |
| Overall Age | 4 | 115.441 | 108.853 | <.0001 | 297.30 |
Prediction
The following graphic and table display prediction intervals (lighter red, LCL-UCL) for (individual) total cholesterol by age and confidence intervals (darker red, LCLM-UCLM) for the mean total cholesterol by age.
DATA Score;
DO Age = 20 TO 80 BY 1;
OUTPUT;
END;
RUN;
ODS EXCLUDE ALL;
PROC PLM RESTORE=SplineModel;
SCORE DATA=Score OUT=Score Predicted LCLM UCLM LCL UCL / ALPHA=0.03125;
RUN;
ODS EXCLUDE NONE;
DATA All;
SET Analysis Score(IN=S);
Score = S;
RUN;
PROC SGPLOT DATA=All NOAUTOLEGEND;
SCATTER X=Age Y=TotChol / MARKERATTRS=(SYMBOL=CircleFilled COLOR=LightBlue);
BAND X=Age Lower=LCL Upper=UCL / FILLATTRS=(COLOR=Red TRANSPARENCY=0.8);
BAND X=Age Lower=LCLM Upper=UCLM / FILLATTRS=(COLOR=Red TRANSPARENCY=0.6);
SERIES X=Age Y=Predicted / LINEATTRS=(COLOR=Red);
RUN;
PROC PRINT DATA=All NOOBS;
VAR Age Predicted LCL UCL LCLM UCLM;
WHERE Score = 1;
RUN;
| Age | Predicted | LCL | UCL | LCLM | UCLM |
|---|---|---|---|---|---|
| 20 | 4.51090 | 2.29027 | 6.73152 | 4.41459 | 4.60720 |
| 21 | 4.54396 | 2.32376 | 6.76415 | 4.45813 | 4.62979 |
| 22 | 4.57702 | 2.35718 | 6.79685 | 4.50107 | 4.65296 |
| 23 | 4.61007 | 2.39053 | 6.82962 | 4.54314 | 4.67701 |
| 24 | 4.64312 | 2.42380 | 6.86245 | 4.58391 | 4.70234 |
| 25 | 4.67616 | 2.45699 | 6.89533 | 4.62288 | 4.72944 |
| 26 | 4.70917 | 2.49009 | 6.92826 | 4.65965 | 4.75869 |
| 27 | 4.74216 | 2.52311 | 6.96121 | 4.69412 | 4.79020 |
| 28 | 4.77511 | 2.55604 | 6.99417 | 4.72656 | 4.82365 |
| 29 | 4.80801 | 2.58891 | 7.02712 | 4.75761 | 4.85841 |
| 30 | 4.84086 | 2.62170 | 7.06003 | 4.78798 | 4.89374 |
| 31 | 4.87366 | 2.65443 | 7.09288 | 4.81833 | 4.92898 |
| 32 | 4.90638 | 2.68711 | 7.12565 | 4.84915 | 4.96362 |
| 33 | 4.93904 | 2.71974 | 7.15833 | 4.88082 | 4.99726 |
| 34 | 4.97161 | 2.75232 | 7.19090 | 4.91357 | 5.02965 |
| 35 | 5.00406 | 2.78481 | 7.22332 | 4.94742 | 5.06071 |
| 36 | 5.03622 | 2.81702 | 7.25541 | 4.98187 | 5.09056 |
| 37 | 5.06787 | 2.84874 | 7.28700 | 5.01625 | 5.11949 |
| 38 | 5.09881 | 2.87974 | 7.31788 | 5.04982 | 5.14780 |
| 39 | 5.12884 | 2.90981 | 7.34787 | 5.08184 | 5.17583 |
| 40 | 5.15774 | 2.93872 | 7.37675 | 5.11166 | 5.20381 |
| 41 | 5.18530 | 2.96628 | 7.40432 | 5.13890 | 5.23171 |
| 42 | 5.21133 | 2.99228 | 7.43038 | 5.16346 | 5.25920 |
| 43 | 5.23561 | 3.01651 | 7.45471 | 5.18554 | 5.28568 |
| 44 | 5.25794 | 3.03878 | 7.47709 | 5.20546 | 5.31042 |
| 45 | 5.27810 | 3.05890 | 7.49730 | 5.22355 | 5.33265 |
| 46 | 5.29590 | 3.07666 | 7.51513 | 5.24010 | 5.35169 |
| 47 | 5.31116 | 3.09192 | 7.53040 | 5.25528 | 5.36704 |
| 48 | 5.32392 | 3.10471 | 7.54313 | 5.26903 | 5.37881 |
| 49 | 5.33425 | 3.11508 | 7.55341 | 5.28116 | 5.38733 |
| 50 | 5.34221 | 3.12309 | 7.56132 | 5.29140 | 5.39302 |
| 51 | 5.34788 | 3.12882 | 7.56694 | 5.29942 | 5.39634 |
| 52 | 5.35133 | 3.13231 | 7.57035 | 5.30487 | 5.39779 |
| 53 | 5.35263 | 3.13364 | 7.57163 | 5.30743 | 5.39784 |
| 54 | 5.35186 | 3.13287 | 7.57085 | 5.30689 | 5.39682 |
| 55 | 5.34908 | 3.13007 | 7.56808 | 5.30329 | 5.39487 |
| 56 | 5.34436 | 3.12532 | 7.56340 | 5.29685 | 5.39187 |
| 57 | 5.33778 | 3.11869 | 7.55687 | 5.28800 | 5.38757 |
| 58 | 5.32941 | 3.11026 | 7.54856 | 5.27722 | 5.38160 |
| 59 | 5.31932 | 3.10012 | 7.53851 | 5.26498 | 5.37365 |
| 60 | 5.30759 | 3.08835 | 7.52683 | 5.25157 | 5.36360 |
| 61 | 5.29430 | 3.07503 | 7.51357 | 5.23714 | 5.35145 |
| 62 | 5.27953 | 3.06025 | 7.49882 | 5.22181 | 5.33726 |
| 63 | 5.26338 | 3.04409 | 7.48266 | 5.20562 | 5.32113 |
| 64 | 5.24591 | 3.02663 | 7.46518 | 5.18860 | 5.30321 |
| 65 | 5.22721 | 3.00796 | 7.44646 | 5.17074 | 5.28367 |
| 66 | 5.20736 | 2.98814 | 7.42659 | 5.15201 | 5.26272 |
| 67 | 5.18645 | 2.96726 | 7.40564 | 5.13232 | 5.24058 |
| 68 | 5.16456 | 2.94539 | 7.38372 | 5.11159 | 5.21752 |
| 69 | 5.14176 | 2.92262 | 7.36090 | 5.08969 | 5.19383 |
| 70 | 5.11814 | 2.89901 | 7.33728 | 5.06648 | 5.16980 |
| 71 | 5.09379 | 2.87465 | 7.31293 | 5.04183 | 5.14574 |
| 72 | 5.06878 | 2.84961 | 7.28795 | 5.01566 | 5.12190 |
| 73 | 5.04320 | 2.82398 | 7.26242 | 4.98792 | 5.09847 |
| 74 | 5.01712 | 2.79782 | 7.23642 | 4.95868 | 5.07556 |
| 75 | 4.99064 | 2.77122 | 7.21005 | 4.92807 | 5.05321 |
| 76 | 4.96382 | 2.74426 | 7.18338 | 4.89626 | 5.03139 |
| 77 | 4.93676 | 2.71702 | 7.15651 | 4.86346 | 5.01007 |
| 78 | 4.90954 | 2.68958 | 7.12950 | 4.82990 | 4.98918 |
| 79 | 4.88224 | 2.66202 | 7.10245 | 4.79578 | 4.96869 |
| 80 | 4.85492 | 2.63441 | 7.07543 | 4.76130 | 4.94854 |
Recommended Practice
To illustrate the recommended practice for linear regression, we will consider a linear regression model for total cholesterol as a function of age and gender.
LIBNAME NHANES "../../../BMI552_Datasets/NHANES_May2023/";
*LIBNAME NHANES "~/my_shared_file_links/u48718377/NHANES";
OPTIONS FMTSEARCH=(NHANES.formats_NHANES work) NODATE NONUMBER;
TITLE1; TITLE2;
ODS NOPROCTITLE;
ODS GRAPHICS / LOESSMAXOBS=20000;
DATA Analysis;
SET NHANES.NHANES;
RUN;
ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = Gender AgeCS / SELECTION=NONE;
OUTPUT OUT=Fitted P=Pred R=Resid;
STORE Initial;
RUN;
ODS EXCLUDE NONE;
To assess the correct structural model assumption, we can examine plots of the residuals against the predictor variables. In this case, we could look at plots of the residuals against age, either overall or by gender. We can supplement these plots with formal model comparisons using BIC.
PROC SGPLOT DATA=Fitted;
REFLINE 0 / AXIS=Y;
LOESS X=Age Y=Resid / LINEATTRS=(COLOR="Red");
RUN;
TITLE "Female";
PROC SGPLOT DATA=Fitted;
REFLINE 0 / AXIS=Y;
LOESS X=Age Y=Resid / LINEATTRS=(COLOR="Red");
WHERE GENDER = 1;
RUN;
TITLE;
TITLE "Male";
PROC SGPLOT DATA=Fitted;
REFLINE 0 / AXIS=Y;
LOESS X=Age Y=Resid / LINEATTRS=(COLOR="Red");
WHERE GENDER = 2;
RUN;
TITLE;
TITLE "Initial";
PROC GLMSELECT DATA=Analysis;
ODS SELECT FitStatistics;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = Gender AgeCS / SELECTION=NONE;
RUN;
TITLE;
TITLE "Initial + G*A";
PROC GLMSELECT DATA=Analysis;
ODS SELECT FitStatistics;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = Gender AgeCS Gender*AgeCS / SELECTION=NONE;
RUN;
TITLE;
TITLE "Initial + 1df for Age";
PROC GLMSELECT DATA=Analysis;
ODS SELECT FitStatistics;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL TotChol = Gender AgeCS / SELECTION=NONE;
RUN;
TITLE;
Least Squares Model (No Selection)
| Root MSE | 1.02643 |
|---|---|
| Dependent Mean | 5.07162 |
| R-Square | 0.0631 |
| Adj R-Sq | 0.0625 |
| AIC | 7620.42604 |
| AICC | 7620.44153 |
| SBC | 424.74615 |
Least Squares Model (No Selection)
| Root MSE | 1.01477 |
|---|---|
| Dependent Mean | 5.07162 |
| R-Square | 0.0848 |
| Adj R-Sq | 0.0837 |
| AIC | 7459.09634 |
| AICC | 7459.13289 |
| SBC | 290.96319 |
Least Squares Model (No Selection)
| Root MSE | 1.02649 |
|---|---|
| Dependent Mean | 5.07162 |
| R-Square | 0.0632 |
| Adj R-Sq | 0.0624 |
| AIC | 7622.29297 |
| AICC | 7622.31290 |
| SBC | 433.49977 |
Based on BIC, the model with the interaction term fits decisively better.
ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = Gender AgeCS Gender*AgeCS / SELECTION=NONE;
OUTPUT OUT=Fitted P=Pred R=Resid;
STORE Revised;
RUN;
ODS EXCLUDE NONE;
To assess the correct structural model assumption, we can again examine plots of the residuals against the predictor variables. In this case, we could look at plots of the residuals against age, either overall or by gender. We can supplement these plots with formal model comparisons using BIC.
PROC SGPLOT DATA=Fitted;
REFLINE 0 / AXIS=Y;
LOESS X=Age Y=Resid / LINEATTRS=(COLOR="Red");
RUN;
TITLE "Female";
PROC SGPLOT DATA=Fitted;
REFLINE 0 / AXIS=Y;
LOESS X=Age Y=Resid / LINEATTRS=(COLOR="Red");
WHERE GENDER = 1;
RUN;
TITLE;
TITLE "Male";
PROC SGPLOT DATA=Fitted;
REFLINE 0 / AXIS=Y;
LOESS X=Age Y=Resid / LINEATTRS=(COLOR="Red");
WHERE GENDER = 2;
RUN;
TITLE;
TITLE "Revised";
PROC GLMSELECT DATA=Analysis;
ODS SELECT FitStatistics;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = Gender AgeCS Gender*AgeCS / SELECTION=NONE;
RUN;
TITLE;
TITLE "Revised + 1df for Age";
PROC GLMSELECT DATA=Analysis;
ODS SELECT FitStatistics;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL TotChol = Gender AgeCS Gender*AgeCS / SELECTION=NONE;
RUN;
TITLE;
Least Squares Model (No Selection)
| Root MSE | 1.01477 |
|---|---|
| Dependent Mean | 5.07162 |
| R-Square | 0.0848 |
| Adj R-Sq | 0.0837 |
| AIC | 7459.09634 |
| AICC | 7459.13289 |
| SBC | 290.96319 |
Least Squares Model (No Selection)
| Root MSE | 1.01479 |
|---|---|
| Dependent Mean | 5.07162 |
| R-Square | 0.0850 |
| Adj R-Sq | 0.0836 |
| AIC | 7461.45391 |
| AICC | 7461.50432 |
| SBC | 307.09414 |
There do not appear to be concerns about the adequacy of the correct structural model assumption for the revised model.
The independence assumption cannot be assessed without knowledge of a putative source of dependence.
The constant (correct) variance assumption can be assessed with a plot of the square root of the absolute values of the residuals against the fitted values.
DATA Fitted;
SET Fitted;
Sqrt_Abs_Resid = SQRT(ABS(Resid));
RUN;
PROC SGPLOT DATA=Fitted;
LOESS X=Pred Y=Sqrt_Abs_Resid / LINEATTRS=(COLOR="Red");
REG X=Pred Y=Sqrt_Abs_Resid / NOMARKERS LINEATTRS=(COLOR="Purple");
RUN;
There are no concerns with the adequacy of the constant variance assumption.
The normality assumption can be assessed with a normal scores plot of the residuals.
PROC UNIVARIATE DATA=Fitted NOPRINT;
QQPLOT Resid / NORMAL(MU=EST SIGMA=EST);
RUN;
The distribution of the residuals is right skewed (not normal), but, given the large sample size and our lack of interest in individual level prediction, the violation of the normality assumption is not problematic.
We can perform F-tests for the the interaction, the overall age effect, and non-linearity of the age effect.
ODS EXCLUDE ALL;
TITLE "Revised Model";
PROC GLMSELECT DATA=Analysis;
ODS SELECT ANOVA;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = Gender AgeCS Gender*AgeCS / SELECTION=NONE;
RUN;
TITLE "Additive Model";
PROC GLMSELECT DATA=Analysis;
ODS SELECT ANOVA;
CLASS Gender;
EFFECT AgeCS = Spline(Age /
NATURALCUBIC BASIS=TPF(NOINT)
KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL TotChol = Gender AgeCS / SELECTION=NONE;
RUN;
TITLE "Linear Age";
PROC GLMSELECT DATA=Analysis;
ODS SELECT ANOVA;
CLASS Gender;
MODEL TotChol = Gender Age Gender*Age / SELECTION=NONE;
RUN;
TITLE "No Age";
PROC GLMSELECT DATA=Analysis;
ODS SELECT ANOVA;
CLASS Gender;
MODEL TotChol = Gender / SELECTION=NONE;
RUN;
ODS EXCLUDE NONE;
Least Squares Model (No Selection)
| Analysis of Variance | |||||
|---|---|---|---|---|---|
| Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
| Model | 9 | 689.42182 | 76.60242 | 74.39 | <.0001 |
| Error | 7225 | 7439.94539 | 1.02975 | ||
| Corrected Total | 7234 | 8129.36721 | |||
Least Squares Model (No Selection)
| Analysis of Variance | |||||
|---|---|---|---|---|---|
| Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
| Model | 5 | 513.24189 | 102.64838 | 97.43 | <.0001 |
| Error | 7229 | 7616.12533 | 1.05355 | ||
| Corrected Total | 7234 | 8129.36721 | |||
Least Squares Model (No Selection)
| Analysis of Variance | |||||
|---|---|---|---|---|---|
| Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
| Model | 3 | 276.51275 | 92.17092 | 84.87 | <.0001 |
| Error | 7231 | 7852.85446 | 1.08600 | ||
| Corrected Total | 7234 | 8129.36721 | |||
Least Squares Model (No Selection)
| Analysis of Variance | |||||
|---|---|---|---|---|---|
| Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
| Model | 1 | 45.48721 | 45.48721 | 40.70 | <.0001 |
| Error | 7233 | 8083.88000 | 1.11764 | ||
| Corrected Total | 7234 | 8129.36721 | |||
ODS EXCLUDE ALL;
DATA FTests;
LENGTH Test $50;
SS_Full = 7439.94539;
DF_Full = 7225;
MS_Full = SS_Full/DF_Full;
Test = "G*A";
SS_Reduced = 7616.12533;
DF_Reduced = 7229;
OUTPUT;
Test = "Non-Linear Age";
SS_Reduced = 7852.85446;
DF_Reduced = 7231;
OUTPUT;
Test = "Age Overall";
SS_Reduced = 8129.36721;
DF_Reduced = 7233;
OUTPUT;
RUN;
DATA FTests;
SET FTests;
SS = SS_Reduced - SS_Full;
DF = DF_Reduced - DF_Full;
MS = SS/DF;
F = MS/MS_Full;
ProbF = EXP(LOGSDF("F",F,DF,DF_Full));
sValue = -LOG(ProbF)/LOG(2);
KEEP Test DF MS F ProbF sValue;
RUN;
ODS EXCLUDE NONE;
TITLE "F-Tests";
PROC PRINT DATA=FTests NOOBS;
VAR Test DF MS F ProbF sValue;
FORMAT sValue 6.2;
RUN;
TITLE;
| Test | DF | MS | F | ProbF | sValue |
|---|---|---|---|---|---|
| G*A | 4 | 44.0450 | 42.7725 | 1.61636E-35 | 115.57 |
| Non-Linear Age | 6 | 68.8182 | 66.8300 | 3.30846E-81 | 267.35 |
| Age Overall | 8 | 86.1777 | 83.6880 | 4.4762E-133 | 439.65 |
There is overwhelming evidence for effects of age that are non-linear and vary by gender.
Estimated mean difference in total cholesterol between age can be obtained using ESTIMATE statements.
ODS EXCLUDE ALL;
PROC PLM RESTORE=Revised;
ESTIMATE "Age 45 v 35, Female" AgeCS [1, 45] [-1, 35] Gender*AgeCS [1, 1 45] [-1, 1 35]/ CL ALPHA=0.03125;
ESTIMATE "Age 55 v 35, Female" AgeCS [1, 55] [-1, 35] Gender*AgeCS [1, 1 55] [-1, 1 35]/ CL ALPHA=0.03125;
ESTIMATE "Age 65 v 35, Female" AgeCS [1, 65] [-1, 35] Gender*AgeCS [1, 1 65] [-1, 1 35]/ CL ALPHA=0.03125;
ESTIMATE "Age 45 v 35, Male" AgeCS [1, 45] [-1, 35] Gender*AgeCS [1, 2 45] [-1, 2 35]/ CL ALPHA=0.03125;
ESTIMATE "Age 55 v 35, Male" AgeCS [1, 55] [-1, 35] Gender*AgeCS [1, 2 55] [-1, 2 35]/ CL ALPHA=0.03125;
ESTIMATE "Age 65 v 35, Male" AgeCS [1, 65] [-1, 35] Gender*AgeCS [1, 2 65] [-1, 2 35]/ CL ALPHA=0.03125;
ODS OUTPUT Estimates=E;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label Lower Estimate Upper;
LABEL Label='00'x;
FORMAT Estimate 6.3 Lower 6.3 Upper 6.3;
RUN;
| Lower | Estimate | Upper | |
|---|---|---|---|
| Age 45 v 35, Female | 0.222 | 0.327 | 0.432 |
| Age 55 v 35, Female | 0.497 | 0.605 | 0.712 |
| Age 65 v 35, Female | 0.496 | 0.603 | 0.710 |
| Age 45 v 35, Male | 0.125 | 0.230 | 0.336 |
| Age 55 v 35, Male | -0.008 | 0.097 | 0.203 |
| Age 65 v 35, Male | -0.287 | -0.180 | -0.073 |